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0 Speech synthesis apparatus and method. 

0 Disclosed is a method and apparatus for reading out a feature parameter and a driver sound source stored 
in a VCV (vowel-consonant-vowel) speech segment file, sequentially connecting the readout parameter and the 
readout sound source information in accordance with a predetermined rule, and supplying connected data to a 

^ speech synthesizer, thereby generating a speech output, including a memory for storing an average power of 
each vowel, and a power controller for controlling to normalize the VCV segment so that powers at both ends of 

^ each VCV segment coincide with the average power of each vowel. 
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SPEECH SYNTHESIS APPARATUS AND METHOD 



BACKGROUND OF THE INVENTION 



Field of the Invention 

5 

The present Invention relates to a rule speech synthesis apparatus and method for performing speech 
synthesis by conrecting parameters for speech segments by rules. 



70 Related Background Art 

A speech rule synthesis apparatuses available as an apparatus for generating speech from character 
train data. A feature parameter (e.g., LPC, PARCOR, LSP.or Mei Cepstrum; to be referred to as a parameter 
hereinafter) of a speech segment registered in a soeech segment file in accordance with information of 

JS character train, data is extracted and combined with a driver sound source signal (i.e., an impulse train in a 
voiced speech period and noise in a voiceless speech period) fn accordsncc with a rate for generating 
synthesized speech. A composite result is supplied to a speech synthesizer to obtain synthesized speech. 
Types of speech segments are generally, a CV (consonant-vowel) segment, a VCV (vowai-consonant-vowel) 
segment, and a CVC (consonant-vowel-consonant) segment. 

20 In order to synthesize speech segments, parameters must be interpolated. Even in interpolation 
performed when a parameter is abruptly changed, speech segments are simply connected by a line in an 
interpolation period according to a conventional technique, sc that spectral information inherent to the 
Speech segments is lost, and the resultant speech may be changed. In the conventional technique, a 
portion of speech uttered as a word or sentence is extracted as a period used as a speech segment. 

25 For this reason, depending on atmospheres wherein human speech is used as speech segments, 
speech powers greatly vary, and a gap is formed between the connected speech segments. As a resuft, 
synthesized speech sounds strange. 

In a conventional method, when speech segments are to be connected in accordance with a mora 
length changed by an utterance speed of a synthesized speech, a vowel, a consonant, and a transition 

3D portion between the vowel and consonant are not considered separately and the entire speech segment 
data is expanded/compressed (reduced) at a uniform rate. 

However, when parameters are simply expanded/reduced and connected to coincide with a syliable- 
beat-point pitch, vowels whose lengths tend to be changed with an utterance speed, phonemes IS/ and /F/, 
and explosion phonemes IP/ and W are uniformly expanded/reduced without discriminating them from each 

as other. The resultant synthesized speech is unclear and cannot be easily heard. 

Durations of Japanese syllables are almcst equal to each other. When speech segments are to be 
combined, parameters are interpolated to uniform sytiable-beat-point pitches, and the resultant synthesized 
speech rhythm becomes unnatural. 

A vowel may become voiceless depending on the preceding and following phoneme atmospheres. For 

40 example, when a word "issiki" is produced, the vowel "i" between "s" and "k" becomes voiceless. This 
can be achieved by rule synthesis in a conventional technique so that when the vowels /i/ of the syilable 
"shi" is to be synthesized, the driver sound source signal is changed into noise for synthesizing a voiceless 
sound from an impulse train for synthesizing a voiced sound without changing the parameter, thereby 
obtaining a voiceless sound. 

45 The feature parameter of the voiced sound which is to be synthesized by an impulse sound source is 
forcively synthesized by a noise sound source, and the synthesized speech becomes uf)natural. 

For example, when a rule synthesis apparatus using a VCV segment as a speech segment has six 
vowels and 25 consonants, 900 speech segments must be prepared, and a large-capacity is required. As a 
result, the apparatus becomes bulky. 

so There are three types of accent, i.e., a sfe-ongest stress start type, a strongest stress center type, and a 
fiat type. For example, each of the strongest stress start and center type accents has three magnitudes, and 
the flat type accent has two magnitudes. The accent corresponding to the input text is determined by only a 
maximunrt of three magnitudes detenmined by the accent type. A dictionary is prestored in accent 
information. 

in a conventional technique, the accent type cannot be changed at the time of te)ct input, and a desired 
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accent is difficult to output. 

A convenlionaf arrangement having no dictionary of accent information corresponding to the input text 
to input tlie text together with the accent information is available. However, tills arrangement requires 
difficult operations. It is not easy to understand rising and failing of the accent by observing only an input 
5 text. Accents of a language different from those of Japanese do not coincide with Japanese accent types 
and are different to produce. 



SUMMARY OF THE INVENTION 

70 

It is an object of the present invention to normalize a power of a speech segment using an average 
value of powers of vowels of the speech segments as a reference to assure continuity at the time of 
combination of speech segments, thereby obtaining smooth synthesized speech. 

!t is another object of the present invention to normalize a power of b speech segment by adjusting en 
IS average value of powers of vowels according to a power characteristic of a word or sentence, thereby 
obtaining synthesized speech in which accents and the like of words or sentence are more natural and 
smooth. 

It is still another object of the present invention to determine a length of a vowel from a mora length 
changed in accordance with an utterance speed so as to correspond to a phoneme characteristic, obtaining 
20 lengths of transition portions from a vowel to a consonant and from a consonant to a vowel by using the 
remaining consonants and vowels, and connecting the speech segments, thereby obtaining synthesized 
speech having a good balance of the length of time between phonemes even if the utterance speed of the 
synthesized speech is changed. 

It is still another object of the present invention to expand/reduce and connect speech segments at an 
25 expansion/reduction rate of a parameter corresponding to the type of speech segment, thereby obtaining 
high-quality .=;peech similar to a human utterance. 

It is still another object of the present Invention to synthesize speech using an exponential approxima- 
tion filter and a basic filter of a normalization orthogonal functior iiaving a larger volunfie of information in a 
low-frequency spectrum, thereby obtaining speech which can be easily understood so as to be suitable for 
so human auditory sensitivity. 

it is still another object of the present invention to l<eep a relative timing interval at the start of a vowel 
constant in accordance with the utterance speed, thereby obtaining speech suitable for the Japanese 
utterance timing. 

It is still another object of the present invention to change a parameter expansion/reduction lu rate in 
35 accordance with whether the length of the speech segment tends to be changed in accordance with a 
change in utterance speed, thereby obtaining clear high-quality speech. 

It Is still another object of the present Invention to synthesize speech by using a consonant parameter 
Immediately preceding a vowel to be converted into a voiceless sound and a noise sound source as a 
sound source to synthesize speech when the vowel Is to be converted into a voiceless sound, thereby 
40 obtaining a more natural voiceless vowel. 

It Is still anotiier object of the present invention to greatly reduce a storage amount of speech segments 
obtained such that one speech segment is inverted and connected on a time axis to use the results as a 
plurality of speech segments, thereby realizing rule synthesis using a compact apparatus. 

It is still another object of the present invention to perform time-axis conversion to use an inverted 
45 speech segment along \he time axis, thereby obtaining natural speech. 

It is still another object of the present invention to Input, together witii a text, a control character 
representing a change in accent and utterance speed at the time of text input, thereby easily changing 
desired states of the accent and utterance speed. 

so 

BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. 1 is a block diagram showing a basic arrangement for performing rule speech synthesis; 
Fig. 2 is a graph showing a power gap in a VCV segment connection; 
55 Rg. 3 Is a graph showing a method of obtaining an average power value of vowels; 

Figs. 4A, 4B. and 4C are graphs showing a vowel power normalization method in a VCV segment; 
Figs. 5A, 5B, and 5C are graphs showing another vowel power normalization method In a VCV segment; 
Fig. 6 Is a graph showing a normalization method of a VCV segment by using a quadratic curve; 
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Fig. 7 is a graph sliowirtg another normalization method of a VCV segment by using a quadratic curve; 
Fig. 8 is a block diagram showing an arrangement for changing a vowel power reference value to 
perform power normalization; 

Figs. 9A to 9D are graphs showing a power normalization method by changing a vowel power reference 
5 value; 

Fig. 10 is a block diagram showing an arrangement for first determining a vowel length when a mora 
length is to be changed; 

Fig. 1 1 is a view showing a mora length, a vowel period, and a consonant period in a speech waveform; 
Fig. 12 is a graph showing a relationship between a mora length, a vowel period, and a consonant 
10 period: 

Rg. 13 is a view showing a connecting method by first determining a vowel length when the mora lengtti 
is to be ctianged; 

Fig. 14 is a block diagram showing.an arrangement for performing speech synthesis at an 
expansion/reduction rate corresponding to the type of phoneme; 
75 Fig. 15 is a block diagram showing a digital filter 5 shown in Fig. 14; 

Fig. 16 is a block diagram showing the first embodiment of one of basic filters 9 to12 in Fig. 15; 

Fig. 17 is a view showing curves obtained by separately plotting real and imaginary parts of a Fourier 

function; 

Fig. 18 is a block diagram showing an arrangement for connecting speech segments; 
20 Rg. 19 is a view showing an expansion/reduction connection of speech segments; 

Fig. 20 is a view for explaining an expansion/reduction of parameters; 

Fig. 21 is a view for further explaining parameter expansion/reduction operations; 

Fig, 22 is a view for explaining operations for connecting parameter information and label infonmation; 

Fig. 23 is a block diagram showing the second embodiment of the basic filters 9 to 12 in Fig. 15; 
35 Fig. 24 is a view showing curves obtained by separately plotting real and imaginary parts of a 

normalization orthogonal function; 

Fig. 25A is a view showing a speech waveform; 

Fig. 25B is a view showing an original parameter series; 

Fig. 25C is a view showing a parameter series for obtaining a voiceless vowel from the parameter series 
30 shown in Fig. 25B: 

Fig. 25D is a view showing a voiceless sound waveform; 

Rg. 25E is a view showing a power control funcfion; 

Fig. 25F is a view showing a power-controlied speech waveform; 

Figs. 26A and 266 are views showing a change in speech waveform when a voiceless vowel is present 
35 in a VCV segment: 

Figs. 27A and 27B are views showing an operation by using a stored speech segment in a form inverted 
along a time axis; 

Fig. 28 is a block diagram showing an arrangement in which a stored speech segment is time-inverted 
and used; 

40 Rg. 29 is a block diagram showing an arrangement for performing speech synthesis of Fig. 28 by using 
a microprocessor; 

Fig. 30 is a view showing a concept for time-inverting and using a speech segment 
Fig. 31 is a block diagram showing an an-angement for inputting a speech synthesis power control agnal 
and a text at the time of text input; 
45 Fig. 32 is a block diagram showing a detailed arrangement of a text ^alyzer shown in Rg. 31 ; 
Fig. 33 is a flow chart for setting accents; 

Rg. 34 is a flow chart for setting an utterance speed (mora length); and 

Rg. 35 is a view showing a speech synthesis power and an input text added with a power control signal. 
50 DESCRIPTION OF THE PREFERRED EMBODIMENTS 



< Interpolation by Normalization of Speech Segment> 

55 Fig. 1 is a block diagram for explaining an embodiment for interpolating a vowel gap between speech 
segment data by normalizing a power of speech segment data when the speech segment data are 
connected to each other. An arrangement of this embodiment comprises a text input means 1 for inputting 
words or sentences to be synthesized, a text analyzer 2 for analyzing an input text and decomposing the 
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text into a phoneme series and for analyzing a control code (i.e., a code for controiiing accent infornnation 
and an utterance speed) included in the input text, a parameter reader 3 for reading necessary speech 
segment parameters from phoneme series information of the text from the text analyzer 2, and a VCV 
parameter file 4 for storing VCV speech segments and speech power information thereof. The arrangement 

5 of this embodiment also Includes a pitch generator 5 for generating pitches from control information from 
the text analyzer 2, a power normalizer 6 for normalizing powers of ttie speech segments read by the 
parameter reader 5, a power normalization data memory 7 for storing a power reference value used in the 
power normalizer 6, a parameter connector 8 for connecting power-normalized speech segment data, a 
speech synthesizer 9 for forming a speech waveform fronn the connected parameter series and pitch 

70 information, and an output means 10 for outputting the speech waveform. 

In this embodiment, in order to normalize a power using an average vowei power as a reference when 
speech segments are to be connected, a standard power value for normalising the power is obtained in 
advance and must be stored in the power normalization data memory 7, and a method of obtaining and 
storing the reference value will be described below. Rg. 3 is a view showing a method of obtaining an 

75 average vowe! power. A constant period V, of a vowel V is extracted in accordance with a change in its 
power, and a feature parameter (b.-j} (1 < i < n, 1 £ j < k) is obtained. In this case, k is an analysis order and 
n is a frame count in the constant period V,. Terms representing pieces of power information are seiected 
from the feature parameters {bi,} (i.e., first-order terms in Mel Cepstrum coefficients) and are added and 
averaged along a time axis (i direction) to obtain an average value of the power terms. The above 

20 operations are performed for every vowel (an average power of even a syllabic nasal is obtained if 
necessary), and an average power of each vowel is obtained and stored in the power normalization data 
memory 7. 

Operations will be described in accordance with a data stream. A text to be analyzed is input from the 
text input means 1 . It is now assumed that a control code for controlling an accent and an utterance speed 
25 is inserted in a character such as a Roman character or a kana character. However, when a speech output 
of a sentence consisting of kanji and kana characters is to be output, a language analyzer is connected to 
the. input of the text input means 1 to convert an input sentence into a sentence consisting of kanji and kana 
characters. 

The text input from the text input means 1 is analyzed by the text analyzer 2 and Is decomposed info 
30 reading information (i.e.. phoneme series information), and information (control information) representing an 
accent position and a generation speed. The phoneme series information is input to the parameter reader 3 
and a designated speech segment parameter is read out from the VCV parameter file 4. The speech 
segment parameter input output from the parameter reader 3 is power-normalized by the power normalizer 
6. 

35 Figs. 4A and 4B are graphs for explaining a method of normalizing a vowel power in a VCV segment. 
Fig. 4A shows a change in power in the VCV data extracted from a data base, Fig. 4B shows a power 
normalization function, and Fig. 40 shows a change in power of the VCV data normalized by using the 
normalization function shown in Rg. 4B. The VCV data extracted from the data base tias variations in its 
power of the same vowel, depending on generation atmospheres. As shown in Fig. 4A, at both ends of the 

40 VCV data, gaps are formed between the average powers of the vowel stored in the power normalization 
data memory 7. The gaps (Zw and Ay) at both ends of the VCV data are measured to generate a line for 
canceling the gaps at both the ends to obtain a normalization function. More specifically, as shown in Fig. 
4B, the gaps {Ax and Ay) at both the ends are connected by a line between the VCV data to obtain the 
power normalization function. 

45 The normalization function generated in Fig. 4B is appited to original data in Rg. 4A, and adjustment is 
performed to cancel the power gaps, ttiereby obtaining the normalized VCV data shown in Fig. 4C. In this 
case, a parameter (e.g., a Met Cepstrum parameter) given as a logarithmic value can be adjusted by an 
addition or subtraction. The normalization function shown in Rg. 4B is added to or subtracted from the 
original data shown in Rg. 4A. thereby simply normalizing the original data. Figs. 4A to 4C show 

50 normalization using a lUlel Cepstrum parameter for the sake of simplicity. 

In the parameter connector 8, the VCV data pawer-normalized by the power normalizer 6 is located so 
that the mora lengths are equidrstantly arranged, and the constant period of the vowel is interpolated, 
thereby generating a parameter series. 

The pitch generator 5 generates a pitch series in accordance with the control information from the text 

ss analyzer 2. A speech waveform is generated by the synthesizer 9 using this pitch series and the parameter 
series obtained from the parameter connector 8. The synthesizer 9 is constituted by a digital filter. The 
generated speech waveform is output from the output means 10. 

This embodiment may be controlled by a p-ogram in a CPU (Central Processing Unit). 
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In the above description, one straight line Is given for one VCV data period as a normalization function 
in the power normalizer 6. However, according to this technique, a C (consonant) portion is also influenced 
by normalization, and its powor is changed. Only vowels are normalized by the following method. 

In the same manner as in normalization of one VCV data as a whole, an average power of each vowei is 
5 obtained and stored In the power normalization data memory 7. Data representing marks at the boundaries 
between the Vs (vowels) and C (consonant) in VCV data used for connection is also stored in the memory. 

Figs. 5A, 5B, and 5C are graphs for explaining normalization of only vowels in VCV data. Fig. 5A shows 
a change in power of the VCV data extracted from a data base, Rg. 5B shows a power normalization 
lunction for normalizing a power of a vowel, and Fig. 50 shows a change In power of the VCV data 
10 normalized by the normalization function. 

In the same manner as In normalization of the VCV data as a whole, gaps (Ax and Ay) between both 
ends of VCV data and the average power of each vowel are measured. As for the gap Ax, In order to cancel 
the gap in the preceding V of the VCV data, a line obtained by connecting Ax and AXO in a period A in Fig. 
5A is defined as a normalization function. Similarly, as for Ay, a line obtained by connecting the gap Ay and 
15 AYO in a period C in Rg. 5A is defined as a power normalization function in oreJer to cancel the gap in the 
range of the following V of the VCV data. No normalization function is set for the consonant in a period B. 

In order to set a power value in practice, the power normalization functions shown in Rg. 58 are applied 
to the original data in Fig. 5A in the same manner as in normalization of the VCV data as a whole, thereby 
obtaining the normalized VCV data shown in Fig. 5C. At this time, a parameter (e.g., a Mel Cepstrum 
20 parameter) given by a iogaritJimic value can be adjusted by an addition/subtraction. The normalization 
functions shown in Rg. SB are subtracted from the original data shown in Rg. 5A to simply obtan 
normalized data. Rgs. 5A to 5C exemplify a case using a Mel Cepstrum parameter for &ie sake of 
simplicity. 

As described above, the power normalization functions are obtained to cancel the gaps between the 
25 average vowel powers and the VCV data powers, and the VCV data is normalized, thereby obtaining more 
natural synthesized speech. Generation of power normalization functions is exemplified by the above two 
cases. However, *e following function may be used as a normalization function. 

Fig, 6 is a graph showing a method of generating a power normalization function In addition to the 
above two normalization functions. The nonmalization funcfion of Fig. 4B is obtained by connecting the gaps 
30 (Ax and Ay) by a line. However, in Fig. 6, a quadratic curve which is set to be zero at both ends of VCV 
data is defined as a power normalization function. The preceding and following interpolation periods of the 
VCV data are not power-adjusted by the nomDalization function. Then the gradient of the power normaliza- 
tion function is gradually decreased to zero, a change in power upon normalization can be smooth near a 
boundary between the VCV data and the average vowel power in the interpolation period. 
35 A power normalization method in this case is the same as that described with reference to the above 
embodiment, 

Rg. 7 shows a graph showing still another method of providing a power normalization function different 
from the above three normalization functions. During the periods A and C of the power normalization 
function in Fig. 4B, a quadratic curve having zero gradient at their boundaries is defined as a power 

40 nonmalization function. Since the preceding and following interpolation periods of the VCV data are not 
power-adjusted by the normalization functions, when the gradients of the power normalization functions are 
gradually decreased to zero, a change in power upon nomialization can be smooth near the boundaries 
between the VCV data and the average vowel powers in the interpolation periods. In this case, the change 
in power near the boundaries of the VCV data can be made smooth. 

46 In tills case, the power nomialization method is the same as described with reference to the above 
embodiment. 

In the above method, the average vowef power has a predetermined value in units of vowels regardless 
of connection timings of the VCV data. However, when a word or sentence is to be synthesized, a change in 
vowel power depending on positions of VCV segments can produce more natural syntiiesized speech. If a 

50 change in power is assumed to be synchronized with a change in pitch, the average vowel power (to be 
referred to as a reference value of each vowel) can be manipulated in synchronism with the pitch. In this 
case, a rise or fall ratio (to be referred to as a power characteristic) for the reference value depending on a 
pitch pattern to be added to synthesized speech is determined, and the reference value Is changed in 
accordance wi*ai this ratio, tiiereby adjusting the power. An arrangement of this technique is shown in Fig. 8, 

55 Circuit components 11 to 20 In Rg. 8 have the same functions as those of the blocks in Fig. 1 , 

The arrangement of Rg. 8 includes a power ,-eference generator 21 for changing a reference power of 
the power normalization data memory 17 in accordance with a pitch pattern generated by the pitch 
generator 15. 
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The arrangement of Fig. 6 is obtained by adding the power refererjce generator 21 to the arrangement 
of the block diagram of Fig. 1, and this circuit comporsent will be described with reference to Figs. 9A to 9D. 

Fig. 9A shows a relationship between s c'r.ange in pov/er and a power reference of each vowe! when 
VCV data is plotted along a time axis in accordance with an input pinoneme series, Fig. 9B shows a power 

5 characteristic obtained in accordance with a pitch pattern, Fig. 9C shows a reference between the power 
reference and the characteristic, and Fig. 9D shows a power obtained upon normaiization of the VCV data. 

When a sentence or word is to be uttered, the start of the sentence or word has a higher power, and 
the power is gradually reduced toward its end, This can be determined by the No. of morae representing a 
syllable count in the sentence or word, and the order of a mora having the highest power in a mora series, 

70 An accent position in a word temporarily has a high power. Therefore, it is possible to assume a power 
characteristic in accordance with a mora count of the word and its accent position. Assume that a power 
characteristic shown in Fig. 9B is given, and that a vowel reference during an interpolation period of Fig. 9A 
is corrected in accordance with this power characteristic. When a Mel Cepstrum coefficient is used, its 
parameter is given as a logarithmic value. As shown in Fig. 9C, the reference is changed by adding the 

15 correction value to or subtracting it from the reference. The changed reference is used to normalize the 
power of the VCV data of Fig. 9A, as shown in Fig. 9D. The normalization method is the same as that 
described above. 

The above normalization method may be controlled by a program in a CPU (Central Processing Unit), 

30 

< Expansion/Reduction of Speech Segment at Synthesized Speech Utterance Speed> 

Fig. 10 is a block diagram showing an arrangement for expanding/reducing speech segment at a 
synthesized speech utterance speed and for synthesizing speech. This arrangement includes a speech 
25 segment reader 31, a speech segment data file 32, a vowel length determinator 33, and a segment 
connector 34. 

The speech segment reader 31 reads speech segment data from the speech segment data file 32 in 
accordance with an input phoneme series. Note that the speech segment data is given in the form of a 
parameter. The vowel length determinator 33 determines the length of a vowel constant period in 

30 accordance with mora length information input thereto. A method of determining the length of the vowel 
constant period will be described with reference to Fig. 11. . 

VCV data has a vowel constant period length V, and a period length 0 except for the vowel constant 
period within one mora. A mora length IV! has a value changed in accordance with the utterance speed. The 
period lengths V and C are changed in accordance with a change in mora length M. When the consonant 

3S and the vowel are shortened at the same retio, the utlerarce speed is high. When a mora length is small, 
the constant can hardly be heard. The vowel period is mininnized as much as possible, and the consonant 
period is maximized as much as possible. When the utterance speed is low and the mora lengtii is large, an 
excessively long constant period causes unnatural sounding of the consonant. In this case, the consonant 
period is kept unchanged, and the vowel period is ch^ged. 

40 Changes in vowel and consonant lengths in accordance with changes in mora length are shown in Rg. 
12. The vowel length is obtained by using a formula representing the characteristic in Fig. 12 to produce 
speech which can be easily understood. Pants mi and mh are characteristic change points and given as 
fixed points. 

Formulas for obtaining V and C by the mora length are designed as follows: 
45 (1) ifM<mJl,then 

V = 1 is given, and (M - 1) is assigned to 0. 

(2) if ml ^ Mi mh.then 

V and C are changed at a given rate upon a change in M. 

(3) mh <- M. then 

50 C Is kept unchanged, and {M - C) is assigned to V. 

The above formulas are represented by the following equation: 

V + C = M 
More specifically. 

if mm ^ M i ml, then 
ss V = vm 

if nl £ M < mh, then 

V = vm + a(M - ml) 
if mh < M, then 
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V = vm + a(mh - ml) + (M - mh) 
if mm S M < mt, then 
C = (M - vm) 
if ml 5 M < mh, then 
s C = (ml - vm) + b(M - mi) 
if mil i M, then 
C = {mi - vm) + b(mti - ml) 
wliere 

a is a value satisfying condition 0 S a ^ 1 upon a ciiange in V. 
10 b is a value satisfying condition 0 ^ b £ 1 upon a change in C. 
a + b = 1 

, vm Is a minimum value of the vowel constant period iengtfi V, 
mm is a minimum value of the mora length M for vm < mm, and 
mt and mh are any values satisfying condition mm ^ mi < mh. 

75 In tlie graph shown in Fig. 12, the mora length is plotted along the abscissa, and the vowel constant 
period length V, the period length C except for the vowel constant period, a sum (V + C) (equal to the 
mora lengtii M) beitween the vowel constant period length V and tlie period length C except for the vowel 
constant period are plotted along tiie ordinate. 

By the above relations, the period length between phonemes is determined by the vowel length 

20 determinator 33 in accordance with input mora length information. Speech parameters are connected by the 
connector 34 in accordance with the determined period length. 

A connecting method is shown in Fig. 13. A waveform is exemplified in Fig. 13 for the sake of easy 
understanding. However, in practice, a connection is performed in the fonm of parameters. 

A vowei constant period length v' of a speech segment is expanded/reduced to coincide witti V. An 

ss expansion/reduction technique may be a method of expanding/reducing parameter data of the vowel 
constant period into linear data, or a method of extracting or inserting parameter data of the vowel constant 
period. A period c' except for the vowel constant period of the speech segment is expanded/reduced to 
coincide with C. An expansion/reduction method is not limited to a specific one. 

The lengths of the speech segment data are adjusted and plotted to generate synthesized speech data. 

90 The present Invention is not limited to the method described above, but various changes and modifications 
may be made. In the above method, the mora length M is divided into three parts, i.e., C, V, and C, thereby 
controlling the period lengths of the phonemes. However, the mora length M need not be divided into three 
parts, and the number of divisions of the mora length M is not limited to a specific one. Alternatively, in 
each vowei, a function or function parameter (vm, mt, mh, a, and b) may be changed to generate a function 

35 optimal for each vowel, thereby determining a period length~of each phoneme. 

In the case of Fig. 13, the syllable beat point pitch of the speech segment waveform is equal to that of 
the synthesized speech. However, since the syllable beat point pitch is changed in accordance with ttie 
utterance speed of the synthesized speech, the values v, and V and the values c' and C are also 
simultaneously changed. 



<Speech Synthesis Apparalus> 

An important basic arrangement for speech synthesis is shown in Fig. 14. 

45 A speech synthesis apparatus in Fig. 14 includes a sound source generator 41 for generating noise or 
an impulse, a rhythm generator 42 for analyzing a rhythm from an Input character train and giving a pitch of 
the sound source generator 41, a parameter controller 43 for determining a VCV parameter and an 
interpolation operation from the input character train, an adjuster 44 for adjusting an amplitude level, a 
digita! filter 45, a parameter buffer 46 for storing parameters for the digital filter 45, a parameter interpolator 

50 47 for interpolating VCV parameters with the parameter buffer 46, and a VCV parameter file 48 for storing 
all VCV parameters. Fig. 1 5 is a block diagram showing an arrangement of the digital filter 45. The digital 
filter 45 comprises basic filters 49 to 52. Fig. 16 is a block diagram showing an arrangement of one of Sie 
basic filters 49 to 52 shown in Fig. 15. 

In this embodiment, the basic filter shown in Fig. 16 comprises a discrete filter for performing synthesis 

55 using a nomnalization orthogonal function developed by the following equation: 
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P + jti 



When this filter is combined witii an exponential function approximation filter, the real number of each 
nornrialization orthogonal function represents a logarithmic spectral characteristic. Fig. 17 shows curves 
obtained by separately plotting the real and imaginary parts of the nonnnajization orthogonal function. 
Judging from Fig. 17, It is apparent that the orthogonal system has a fine characteristic In the low-frequency 
range and a coarse characteristic in the high-frequency range. A parameter C„ of this synthesizer is 
obtained as a Fourier-transformed value of a frequency-converted iogarithmic spectrum. When frequency 
conversion is approximated in a Mel unit, It is called a IVIel Cepstrum. In this embodiment, frequency 
conversion need not always be approximated in the Mel unit- 

A delay free loop is eliminated irom the filter shown in Fig. 16, and a filter coefficient bn can be derived 
from the parameter C„ as follows: 

2 

P 

T 

a = 

2 

P + - 
T 



Under this condition, bl^i + l = 2aCN bn = C„ + a{2G„.y • b„*,) for 2 ^ n N 
b, = (2C, - ab2)/(l - -^a) 
bo = Co - abi 

30 A processing flow in Fig. 14 will be described in detail below. 

A character train is input to the rhythm generator 42, and pitch data P{t) is output from the rhythm 
generator 42. The sound source generator 41 generates noise in a voiceless period and an impulse in a 
voiced period. At the same time, the character train is also input to the parameter controller 43, so that the 
types of VCV parameter and an interpolation operation are determined. The VCV parameters determined by 

35 the parameter controller 43 are read out from the VCV parameter file 48 and connected by the parameter 
interpolator 47 in accordance with the interpolation method detenmined by the parameter controller 43. The 
connected parameters are stored in the parameter buffer 46. The parameter interpolator 47 performs 
interpolation of parameters betv/een the vowels when VCV parameters are to be connected. Since the 
parameter has a fine characteristic in the low-frequency range and a coarse characteristic in the high- 

'to frequency range, and since the logarithmic spectrum is represented by a linear sum of parameters, linear 
interpolation can be appropriately performed, thus minimizing distortions. The parameter stored in the 
parameter buffer 46 is divided into a portion containing a nondelay component (bo) and a portion containing 
delay components (bi, ba.-. bn+i-). The former component is input to the amplitude level adjuster 44 so 
that an output from the sound source generator 41 is multiplied vnth exp(bo). 

45 



<Expanslon/Reduction of Parametet> 



Fig. 18 is a blocl< diagram showing an arrangement for practicing a method of changing an 
50 expansion/reduction ratio of speech segments in correspondence with types of speech segments upon a 
change in utterance speed of the synthesized speech when speech segments are to be connected. This 
arrangement includes a character series input 101 for receiving a character series. For example, when 
speech to be synthesized is /on sei/ {which means speech), a character train "OnSEI" is input. 

A VCV series generator 102 converts the character train input from the character series input 101 into a 
55 VCV series, e.g.. "QO, On, nSE, El, IQ". 

A \ CV parameter memory 103 stores V (vowel) and CV parameters as VCV parameter segment data or 
word start or end data corresponding to each VCV of the VCV series generated by the VCV series 
generator 102. 
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A VCV label memory 104 stores acoustic boundary discrimination labels {e.g., a vowel start, a voiced 
period, a voiceless period, and a syllable beat point of each VCV parameter segment stored in the VCV 
parameter memory 103) together with their position data. 

A syllable beat point pitch setting means 105 sets a syllable beat point pitch in accordance with an 
5 utterance speed of synthesized speech. A vowel constant lerigth setting means 106 sets the length of a 
constant period of a vowel associated with connection of VCV parameters in accordance with the syllable 
beat point pitch set by the syllable beat point pitch setting means 105 and the type of vowel. 

A parameter expansion/reduction rate setting means 107 sets an expansion/reduction rate for 
expanding/reducing VCV parameters stored in the VCV parameter memory 103 in accordance with the 
10 types of labels stored in the VCV label memory 104 in such a manner that a larger expansion/reduction rate 
is given to a vowel, /SI. and /F/, the lengths of which tend to be changed in accordance with a change in 
utterance speed, and a smaller expansion/reduction rate is given to an exp/osive consonant such as /P/ and 

m. 

A VCV EXP/RED connector 108 reads out from the VCV parameter memory 103 parameters cor- 
75 responding to the VCV series generated by the VCV series generator 102, and reads out the corresponding 
labels from the VCV label memory 104. An expansion/reduction rate is assigned to the parameters by the 
parameter EXP/RED rate setting means 107, and the lengths of the vowels associated with the connection 
are set by the vowel consonant length setting means 106. The parameters are expanded/reduced and 
connected to coincide with a syllable beat point pitch set by the syllable beat point pitch setting means 105 
20 in accordance with a method to be described later with reference to Fig. 19. 

A pitch pattern generator 109 generates a pitch pattern in accordance with accent informafion for the 
character train input by the character series input 101 . 

A driver sound source 1 1 0 generates a sound source signal such as an impulse train. 
A speech synthesizer 111 sequentially synthesizes the VCV parameters output from the VCV EXP/RED 
25 connector 108, the pitch patterns output from the pitch pattern generator 109, and the driver sound sources 
output from the driver sound source 110 In accordance with predetermined rules and outputs synthesize 
speech. 

Rg. 19 is ar^ operation tor expanding/reducing and connecting VCV parameters as speech segments. 
(A1) shows part of an utterance of "ASA" (which means morning) in a speech waveform file prior to 
30 extraction of the VCC segment, (A2) shows part of an utterance of "ASA" in the speech waveform file 
prior to extraction of the VCV segment. 

(B1) shows a conversion result of waveform information shown in (A!) into parameters. (Sj) shows a 
conversion result of the waveform information of (A2} into parameters, these parameters are stored in the 
VCV parameter memory 103 in Rg. 14. (Bs) shows an interpolation result of spectral parameter data 
35 interpolated between the parameters. The spectral parameter data has a length set by a syllable t>eat 
point pitch and types of vowels associated with the connection. 

(CI) shows an acoustic parameter boundary position represented by label information corresponding to 
(Al) and (B1). {C2) shows an acoustic parameter boundary position represented by label information 
corresponding to (A2} and (B2). These pieces of label information are stored in the VCV label memory 
40 104 in Fig. 14. Note that a label "?" corresponds to a syllable beat point. 

(D) shows parameters connected after pieces of parameter information' corresponding to a portion from 
the syllable beat point of (CI) to the syllable beat point of (C2) are extracted from (Bl), (B3), and (B2). 

(E) shows label information corresponding to (D). 

(F) shov/s an expansion/reduction rate set by the types of adjacent labels and represents a relative 
45 measure used when the parameters are expanded or reduced in accordance with the syllable beat point 

pitch of the synthesized speech. 

(G) shows parameters expanded/reduced in accordance with the syllable beat point pitch. These 
parameters are sequentially generated and connected In accordance with the VCV series of speech to 

be synthesized. 

60 (H) shows label information corresponding to (6). These pieces of label information are sequentially 
generated -and connected in accordance with the VCV series of the speech to be synthesized. 
Fig. 20 shows parameters before and after they are expanded/reduced so as to explain an 
expansion/reduction operation of the parameter. In this case, the corresponding labels, the 
expansion/reduction rate of the parameters between the labels, and the length of the parameter after it is 
65 expanded/reduced are predetermined. More specifically, the label count is (n + 1), a hatched portion in Fig. 
20 represents a labeled frame, si (1 ^ i ^ n) is a pitch betwem labels before expansion/reduction, ei (1 S i S 
n) is an expansion/reduction rate, di (1 S i ^ n) is a pitch between labels after expansion/reduction, and dO is 
the length of a parameter after expansion/reduction. 



11 



EP 0 427 485 A2 



A pitch di which satisfies the following relation is obtained: 
dl - si di - si 



si si 
= el : ... ; ei : ... : en 

dl + ... + di + ... + dn = dO 



Parameters corresponding to si (1 S i S n) are expanded/reduced to the lengths of di and are 
sequentially connected. 

Fig. 21 is a view for further explairiing a parameter expansion/reduction operation and shows a 
parameter before and after expansion/reduction. In this case, the lengths of the parameters before and after 
expansion/reduction are predetermined. More specifically, k is the order of each parameter, s is the length 
of the parameter before expansion/reduction, and d is the length of the parameter after expansion/reduction. 

The jth {1 S j ^ d) frame of the parameter after expansion/reduction is obtained by the following 
sequence. 

A value X defined by the following equation is calculated: 

i = l " 

If the value x Is an integer, the xth frame before expansion/reduction is inserted in the jth frame position 
after expansion/reduction. Othenwise, a maximum integer which does not exceed x is defined as i, and a 
result obtained by weighting and averaging the ith frame before expansion/reduction and the (t + 1)th frame 
before expansion/reduction to the (x - 1) vs. (1 - x + i) is inserted into fhe jth frame position after 
expansion/reduction. 

The above operation is performed for ail the values j, and the parameter after expansion/reduction can 
be obtained, 

Fig. 22 is a view for explaining an operation for sequentially generating and connecting parameter 
information and label infomnation in accordance with the VCV series of the speech to be synthesized. For 
example, speech "OnSEl" (which means speech) Is to be synthesized. 

The speech "OnSEI" is segmented Into five VCV phoneme series fQO/, /On/, /nSE/. /El/, and /IQ/ where 
Q represents silence. 

The parameter information and the label information of the first phoneme series /QO/ are read out, and 
the pieces of information up to the first syllable beat point are stored in an output buffer. 

In the processing described with reference to Figs. 15, 16, and 17, four pieces of parameter information 
and four pieces of label information are added and connected to the stored pieces of information in the 
output buffer. Note that connections are performed so tiiat the frames corresponding to the syllable beat 
points (label "?") are superposed on each other. 

The above, operations have been described Vw'ith reference to speech synthesis by a Fourier circuit 
network using VCV data as speech segments. Another method for performing speech synthesis by an 
exponential function filter using VCV data as speech segments will be described below. 

An overali arrangement for performing speech synthesis using the exponentiai function filter, and an 
arrangement of a digital fitter 45 are the sane as those in the Fourier circuit network. These arrangements 
have been described with reference to Figs. 1 and 15, and a detailed description thereof will be omitted. 

Fig. 23 shows an arrangement of one of basic filters 49 to 52 shown in Fig. 15. Fig. 24 shows curves 
obtained by separately plotting the real and imaginary parts of a normalization orthogonal function. 

In this embodiment, tlie normalization orthogonai function is developed as follows: 
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U2(w ) = \^ 



P + j(0 
P - jii) 



03(0)) = \rBp 



P + jw 2P + jO) 
P - jw 2P - jcj 



P + }{0 2P + joj 3P + ju 



The abo\^e function is realized by a discrete filter using bilinear conversion as the basic filter shown in 
jg Fig, 23. Judging from the characteristic curves in Fig. 24, the orthogonal system has a fine characteristic in 
the low-frequency range and a coarse characteristic in the high-frequency range. 

A delay free loop is eliminated from this filter, and a filter coefficient bn can be derived from On as 
follows: 

bN+I = 2(1 -P„)Kjj - 



^ 2K^-lCn-l} 
for 2 < n 5 N 



JPl(l + Pq) f b2 \ . 



np 

Pn = 



2 

np + — 
T 



where T Is the sample period. 



V'Snp 
Kn = 



2 

np - - 
T 
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When speech synthesis is to be performed using this exponentlaJ function filter, operations in Fig. 14 
and a method of connecting the speech segments are the same as those in the Fourier circuit network, and 
a detailed description thereof will be omitted. 

In the above description, development of the system function is exemplified by the normalization 
s orthogonal systems of the Fourier function and the exponential function. However, any function except for 
the Fourier or exponential function may be used if the function is a normalization orthogonal function which 
has a larger volume of information in the low-frequency specfrum, 



TO <Voiceless Vowel> 

Figs. 25A to 25F are views showing a case wherein a voiceless vowel is synthesized as natural speech. 
Fig. 25A shows speech segment data including a voiceless speech period, Fig. 25B shows a parameter 
series of a speech segment, Fig. 25C shows a parameter series obtained by substituting a parameter of a 
15 voiceless portion of the vowel with a parameter series of the Immediately preceding consonant, Fig. 25D 
shows the resultant voiceless speech segment data, Fig. 25E shows a power control function of the 
voiceless speech segment data, and Fig. 25F shows a power-controlled voiceless speech waveform. A 
method of producing a voiceless vowel will be described with reference to the accompanying drawings. 
Conditions for producing a voiceless vowel are given as follows: 
20 (1) Voiceless vowels are limited to /i/ and /u/. 

(2) A consonant immediately precedin.g a voiceless vowel is one of silent fricative sounds /s/. /h/, Id, and 
/f/, and explosive sounds /p/, IV, and M 

(3) When a consonant follows a voiceless vowel, the consonant is one of explosive sounds /p/, M and Jkl. 
When the above three conditions are satisfied, a voiceless vowel is produced. However, when a vowel 

35 is present at thfi end of a word, a voiceless vowel is produced when conditions (1) and (2) are satisfied. 

When a voiceless vowel is determined to be produced in accordance with the above conditions, speech 
segment data including voiceless vowel (in practice, a feature parameter series (Fig. 25B) obtained by 
analyzing speech) is extracted from the data base. At this time, the speech segment data is labeled with 
acoustic boundary infomiation. as shown in Rg. 25A. Data representing a period from the start of vowel to 

30 the end of vowel is changed to data of consonant constant period C from the label information. As a method 
for this, a parameter of the consonant constant period C is linearly expanded to the end of the vowel to 
insert a consonant paramet^ in the period V, as shown in Fig. 25C. A sound source for the period V is 
determined to select a noise sound source. 

If power control is required to prevent fonmatlon of power gaps upon connection of the speech 

35 segments and production of a strange sound, a power control characteristic correction function having a 
zero value near the end of silence is set and applied to the power tenm of the parameter, thereby 
performing power control, as shown in Fig. 25D. When the coefficient is a Mel Cepstrum coefficient, its 
parameter is represented by a logarithmic value. The power characteristic correction function is subtracted 
from the power tenm to control power control. 

40 The method of producing a voiceless vowel when a speech segment is given cis CV (consonant-vowel) 
segment has been described above. However, the above operation is not limited to any specific speech 
segment, e.g., the CV segment When the speech segment is larger than a CV segment, (e.g., a CVC 
segment in this case, a consonant is connected to ^e vowel, or the consonants are to be connected to 
each other), a voiceless vowel can be obtained in the same method as described above. 

45 An operation perfomned when a speech segment is given as a VCV segment {vowel-consonant-vowel 
segment), that is, when the vowels are connected at the time of speech segment ccmnection, will be 
described with reference to Rgs. 26A and 26B. 

Rg. 26A shows a VCV segment including a voiceless period, and Fig. 26B shows a speech waveform 
for obtaining a voiceless portion of a speech period V. 

50 This op>er3tion will be described with reference to Figs. 26A and 26B. Speech segment data.is extracted 
from the data base. When connection is performed uang a VCV segment, vowel constant periods of the 
preceding VCV segment and the following VCV segment are generally interpolated to perform the 
connection, as shown in Rg. 26A. In this case, when a voiceless vowel is to be produced, a vowel between 
the preceding and following VCV segments is produced as a voiceless vowel. The VCV segment is located 

55 in accordance with a mora position. As shown in Rg. 26B, data of the vowel period V from tiie start of the 
vowel after the pr-jcecfing VCV segment to the end of the vowel before the following VCV segment is 
changed to data of tine consonant constant pa-iod C of the preceding VCV segment. As this method has 
been described in the first embodiment, the parameter of the consonant constant period C is linearly 



EP 0 427 485 A2 



expanded to the end of vowel, and the sound source is given as a noise sound source to obtain a vdceless 
vowel period. If power control is required, the power can be controlled by the metiiod described with 
reference to Rg. 1 . 

The voiceless vowel described above can be obtained in th© arrangenient shown in Fig. 1. The 
5 arrangement of Fig. 1 has been described before, and a detailed description thereof wilt be omitted. 

A method of synthesizing phonemes to obtain a voiceless vowel as natural speech is not linnited to the 
above method, but various changes and modifications may be made. For example, when a parameter of a 
vowel period is to be changed to a parameter of a consonant period, the constant period of the consonant is 
linearly expanded to the end of the vowel in the above method. However, the parameter of the consonant 
70 constant perbd may be partially copied to tie vowel period, thereby substituting the parameters. 



< Storage of Speech Segment> 

75 Necessary VCV segments must be prestored to generate a speech par^eter series in order to 
perfonm speech synthesis. When all VCV combinations are stored, a memory capacity becomes very large. 
Various VCV segments can be generated from one VCV segment by time inversion and time-axis 
conversion, thereby reducing the number of VCV segments stored in the memory. For example, as show in 
Fig. 27A, the number of VCV segments can be reduced. More spectficaliy, a W pattern is produced when a 

20 vowel chain is given in a VCV character train. Since the vowel chain is generally symmetrical about the fime 
axis, the time axis is inverted to generate another pattern. As shown in Fig. 27A, an /At/ pattern can be 
obtained by invertir>g an /lA/ pattern, and vice versa. Therefore, only one of flie /lA/ and /Al/ patterns is 
stored. Fig. 27B shows an utterance "NAGANO" (ttie name of place in Japan). An /ANO/ pattern can be 
produced by inverting an ZONA/ pattern. However, in a VCV pattern including a nasal sound has a start 

ss duration of the nasal sound different from its end duration. In this case, time-axis conversion is performed 
using an appn:^)riate lime conversion function. An /AGA/ pattern is obtained such that an /AQ/V pattern as a 
VCV pattenn is obtained by time-inverting and connecting the /AG/or /GA/ pattern, and then the start 
duration of the nasal component and the end duration of the nasal component are adjusted with each other. 
Time-axis conver^on is perfomned in accordance with a table look-up system in which a time conversion 

30 function is obtained by DP and is stored in the form of a table in a memory. When time conversion is linear, 
linear function paramet«-s may be stored and linear function calculations may be performed to covert the 
time axis. 

Fig. 28 is a block diagram showing a speech synthesis arrangement using data obtained by time 
inversion and time-axis conversion of VCV data prestored in a memory. 

35 Referring to Fig. 28, this arrangement Includes a text analyzer 61, a sound source controller 62, a sound 
source generator 63, an inpulse source generator 64, a noise source generator 65, a mora connector 66, a 
VCV data memory 67, a VCV data inverter 68, a time axis converter 69, a speech synthesizer 70 including 
a synthesis filter, a speech output 71, and a speaker 72. 

Speech synthesis processing in Fig. 28 will be described below. A text represented by a character train 

40 for speech synthesis is analyzed by the text analyzer 61, so that changeover between voiced and voiceless 
sounds, high and low pitches, a change in connection time, and an order of VCV connections are extracted. 
Information associated with the sound source (e.g., changeover between voiced and voiceless sounds, and 
the high and low pitches) is sent to the sound source controller 62. The sound source controller 62 
generates a code for controlling the sound source generator 63 on the basis of the input information. The 

45 sound source generator 63 comprises the impulse source generator 64, the noise source generator 65, and 
a switch for switching between the impulse and noise source generators 64 and 65. The impulse source 
generator 64 is used as a sound source for voiced sounds. An impulse pitch is controlled by a pitch control 
code sent from the sound source controller 62. The noise source generator 65 is used as a voiceless sound 
source. These two sound sources are switched by a voicedA^oiceiess switdiing control code sent from the 

so sound source controller 62. The mora connector 66 reads out VCV data from the VCV data memory 67 and 
connects them on the basis of VCV connection data obtained by the text analyzer 61. Connection 
procedures will be described below. 

The VCV data are stored as a speech parameter series of a higher order such as a mel cepstrum 
parameter series in the VCV data memory 67. In addition to the speech parameters, the VCV data memory 

55 67 also stores VCV pattern names using phoneme marks, a flag representing whether inversion data is used 
{when the inversion data is used, the flag is set at "1 otherwise, it is set at "0"), and a CVC pattern name 
of a VCV pattern used when the inversion data is to be used. The VCV data memory 67 further stores a 
time-axis conversion flag for determining whether tiie time axis is converted (when the time axis is 
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converted, the flag is set at "1 "; otherwise, it is set at "0" , and addresses representing the time conversion 
function or table. When a VCV pattern is to be read out. and the inversion flag is set at "1", an inversion 
VCV pattern is sent to the VCV inverter 68, and the VCV pattern is inverted aiong the time axis. If the 
inversion flag is set at "0", the VCV pattern is not supplied to the VCV inverter 68. If the time axis 

s conversion flag is set at "1 the time axis is converted by the time axis converter 69. Time axis conversion 
can be performed by a table look-up system using a conversion table for storing conversion function 
parameters, thereby performing time axis conversion by function operations. The mora connector 66 
connects VCV data output from the VCV data memory 67, the VCV inverter 68, and the time axis converter 
69 on the basis of mora connection information. 

10 A speech parameter series obtained by VCV connections in the mora connector 66 is synthesized with 
the sound source parameter series output from the sound source generator 63 by the speech synthesizer 
70. The synthesized result is sent to the speech output 71 and is produced as a sound from the speaker 
72. 

An arrangement for performing the above processing by using a microprocessor wll! be described with 
16 reference to Rg. 29 below. 

Referring to Fig. 29, this anrangement includes an interface (l/F) 73 for sending a text onto a bus, a 
read-only memory (ROM) 74 for storing programs and VCV data, a buffer random access memory (RAM) 
75, a direct memory access controller (DMA) 76, a speech synthesizer 77, a speech output 78 comprising a 
filter and an amplifier, a speaker 79, and a processor 80 for controlling the overall operations of the 
20 arrangement. 

Ttie text is temporarily stored in the RAM 75 through the interface 73. This text is processed in 
accordance with the programs stored ir the ROM 74 and is added with a VCV connection code and a 
sound source control code. The resultant text is stored again in the RAM 75. The stored data is sent to the 
speech synthesizer 77 through the DMA 76 and is converted into speech with a pitch. The speech with a 

25 pitch is output as a sound from the speaker 79 fhrougii the speech output 78. The above control Is 
performed by the processor 80. 

(n the above description, the VCV parameter series is exemplified by the Mel Cepstrum parameter 
series. However, another parameter series such as a PARCOR, LSP, and LPS Cepstrum parameter series 
may be used in place of the Mel Cepstrum parameter series. The VCV segment Is exemplified as a speech 

30 segment. However, other segments such as a CVC segment may be similarly processed. In addition, when 
a speech output is generated by a combination of CV and VC segments, the CV pattern may be generated 
from the VC pattern, and vice versa. 

When a speech segment is to be inverted, the inverter need not be additionally provided. As shown in 
Fig. 30, a technique for assigning a pointer at the end of a speech segment and reading it from the reverse 

35 direction may be employed. 



<Text input> 

40 The following embodiment exemplifies a method of synthesizing speech with a desired accent by 
inputting a speech accent control mark together with a character train when a text to be synthe^zed as 
speech is input as a character train. 

Fig. 31 is a block diagram showing an arrangement of this embodiment. This arrangement includes a 
text analyzer 81 , a parameter connector 82. a.pitch generator 83, ^d a speecfi signal generator 84. An 

45 input text consisting of Roman characters and control characters is exb-acted in units of VCV segments (i.e., 
speech segments) by the text analyzer 81. The VCV parameters stored as Mel Cepstrum parameters are 
expanded/reduced and connected by the parameter connector 82, thereby obtaining speech parameters. A 
pitch pattern is added to this speech parameter by the pitch generator 83. The resultant data is sent to the 
speech signal generator 84 and is output as a speech signal. 

so Rg. 32 is a block diagram lowing a detailed arrangement of the text analyzer 81. The type of 
character of the input text is discriminated by a character sort discriminator 91. if the discriminated 
character is a mora segmentation character (e.g.. a vowel, a syllabic nasal sound, a long vowel, or a double 
consonant), a VCV table 92 for storing VCV segment paroneters accessible by VCV Nos. in a VCV No. 
getting means 93 is accessed, and a VCV No. is set in tfie input text analysis output data. A VCV type 

ss setting means 94 sets a VCV type (e.9., voiced/voiceless, long vowel/double consonant, silence, word 
startMord end, double v( wel, sentence end) so as to correspond to the VGV No. extracted by the VCV No. 
getting means 93. A presumed syllable beat point setting means 95 sets a presumed syllable beat point, 
and a phrase setting means 97 sets a phrase (breather). 
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This embodiment is associated with setting of an accent and a presumed syllabic beat point in the input 
analyzer 81 . The accent and the presumed syllabic beat point are set in units of morae and are sent to the 
pitch generator 83. When the accent is set by the input text, for example, when a Tokyo dialogue is to be 
set, an input "hashi" (which means a bridge is described as "HA/SHI", and an input "hash!" (which means 
5 chopsticks) is described as "/HA SHI", Accent control is performed by control marks "/" and "\". The 
accent is raised by one level by the mark "/", and the accent is lowered by one levei by the mark "\". 
Similarly, the accent is raised by two levels by the marks "//", and the accent is raised by one level by the 
marks "/A" or "A/". 

Rg. 33 is a flow chart for setting an accent The mora No. and the accent are iniiialized (Sa i ). An input 
10 text is read ciiaracter by character (832), and the character sort is determined (S33). If an input character is 
an accent control mark, it is determined wheltier it is an accent raising mark or an accent lowering mark 
(S34). If it is determined to be an accent raising mark, the accent is raised by one levei (836). However, if it 
is determined to be an accent lowering mark, the accent is lowered by one level (337), If the input character 
is determined not to be an accent control mark (833), it is determined whether it is a character at the end of 
IS the sentence (S35). if YES in step S35, the processing is ended. Otherwise, the accent is set in the VCV 
data (S3d). 

A processing sequence will be described with reference to the flow chart shown in Fig. 33 wherein an 
output of the text analyzer is generated when an input text "KO//RE\WA //PE\NVDE\SU/KA/." is entered. The 
accent is initialized to 0 (331). 

20 A character "K" is input (332) and its character sort is determined by the character sort discriminator 
91 (333). The character "K" is neither a control mark nor a mora segmentation character and is then stored 
in the VCV buffer. A character "0" is neither a control mark nor a mora segmentation character and is 
stored in the VCV buffer. The VCV No. getting means 93 accesses the VCV table 92 by using the character 
train "KO" as a key in the VCV buffer (838). An accent value of 0 is set in the text analyzer output data in 

25 response to the input "KO", the VCV buffer is cleared to zero in the VCV buffer (831). A character "/" is 
then input to the VCV buffer, and its type is discriminated (833). Since the character 7" is an accent raising 
control mark (S34). the accent value is incremented by one (S36). Another character 7" is input to further 
Increment the accent value by one (S36), thereby setting the accent value to 2. A character "R" is input and 
its character type is discriminated and stored in the VCV buffer. A character "E" is then input and its 

30 character type is discriminated. TTie character "E" is a Roman character and a segmentation character, so 
that it is stored in the VCV buffer. The VCV table is accessed using the character train "ORE" as a key in 
the VCV buffer, thereby accessing the corresponding VCV No, The input text analyzer output data 
corresponding to the character train "ORE" is set together with the accent value of 2 (S38). The VCV buffer 
is then cleared, and a character "E" is stored In the VCV buffer. A character "\" is then input (S32) and its 

35 character type is discriminated (S33). Since the character "\" is an accent lowering control mark (834), the 
accent value is decremented by one (837), so that the accent value is set to be 1 . The same processing as 
described above is performed, and the accent value of 1 of the input text analyzer output data "EWA" is 
set When (n + 1 spaces are counted as n morae, the input "KO/RE\WAV/PE\NVDE\SU/KA/." can be 
decomposed into morae as follows: ~ 

40 "KO" + "ORE" + "EWA" + "A" + "PE" + "EN" + "NDE" + "ESU" + "UKA" + "A" 
and the accent values of Oie respective morae are set within the parentheses: 

"KO (0)" + "ORE (2)" + "EWA (1)" + "A (0)" + "PE (2)" + "EN (1)" + "NDE (1)" + "ESU (0)" + 

"UKA(1)" + "A (2)" 

The resultant mora series Is input to the pitch generator 83, thereby generating tiie accent components 

45 shown in Fig. 35. 

Fig. 34 is a flow chart tor setting an utterance speed. 

Control of the mora pitch an the utterance speed is performed by control marks "-" an " + " in the same 
manner as accent control. The syllable beat point pitch is decremented by one by the mark "-" to increase 
the utterance speed. The syllable beat point pitch is incremented by one by the mark " + " to decrease the 
50 utterance speed. 

A character train input to the text analyzer 81 is extracted in units of morae, and a syllable beat point 
and a syllable beat point pitch are added to each mora. The resultant data is sent to the parameter 
connector 82 and the pitch generator 83. 

The syllable beat point is initialized to be 0 (msec), and the syllable beat point pitch is initialized to be 
55 96 (160 msec). 

When an input "A + lU-E-O" is entered, the input is extracted in units of norae. A presumed syllable 
beat point position (represented by brackets []) serving as a reference before a change is added by an 
utterance speed control code, and the next input text analyzer output data is generated as follows; "A [16]" 



17 



EP 0 427 485 A2 



+ "Al [33]" + "lU [50]" + "UE [65]" + "EO [79]" + "0 [94]" 

Setting of an utterance speed (mora pitch) will be described with reference to a flow chart in Fig. 34. 

The syllable beat point is initialized to be 0 (msec), and the presumed syllable beat point is initialized to 
be 96 (160 msec) (S41). A text consisting of Roman letters and control marks is input (S42), and the input 
5 text is read character by character in the character type discriminator 91 to discriminate the character type 
or sort (S43). If an input character is a mora pitch control mark (S43). it is determined whether it is a 
deceleration or acceleration mark (S**). If the character is determined to be the deceleration mark, the 
syllable beat point pitch is incremented by one (S46). However, if the input character is determined to be 
the acceleration mark, the syllable beat point pitch is decremented by one (S47). When the syllable beat 
70 point pitch is changed (S48 and S47), the next one character is input from the input text to the character 
sort discriminator 91 {S42). When the character type is determined not to be a mora pitch control mark in 
step S43, it is determined to be located at the end of the sentence (S45). If NO in step S45, the VCV data 
is set without changing the presumed syllable beat point pitch (S48). However, if YES in step S45, the 
processing is ended. 

75 When the syllable beat point pitch is changed in processing for setting the utterance speed, the position 
of the presumed syllable beat point is also changed. 

Processing for the accent and speed change is performed in the CPU (Central Processing Unit), 
In the foregoing, the word "mora " has the meaning required by the context, and includes but is not 
limrted to meaning the duration of a short syllable. Ihe words "vowel" and "consonant" do not imply any 
20 particular linguistic model or group of (wiguages; the invention is applicable in general to groups of parts of 
speech and transitions therebetween, as will be understood from the foregoing. The word "voiceless" will 
be understood to mean "unvoiced". 



25 Claims 

1. A speech synthesis apparatus for reading out a feature parameter and a driver sound source stored in a 
VCV (vowel-consonant- vowel speech segment file, sequentially connecting the readout parameter and the 
readout sound source information in accordance with a predetermined rule, and supplying connected data 

30 to a speech synthesizer, thereby generating a speech output, comprising: 
memory means for storing an average power of each vowel; and 

power control means for controlling to normalize the VCV segment so tliat powers at both ends of each 
VCV segment coincide witii the average power of each vowel. 

2. An apparatus according to claim 1 , wherein said power control means normalizes the VCV segment as a 
36 whole. 

3. An apparatus according to claim 1 , wherein sard power control means normalizes only a vowel of the 
VCV segment 

4. An apparatus according to claim 1 , wherein said power control means adjusts the average power of each 
vowel In accordance with a power characteristic of a word or sentence and nonmalizes the power of the 

40 VCV segment. 

5. A method of reading out a feature parameter and a driver sound source registered in a VCV (vowel- 
consonant-vowel) speech segment file in accordance with a phoneme series of speech to be synthesized, 
sequentially connecting the readout parameter and the readout sound source information in accordance with 
& predetermined rule, and supplying ihe connected data to a speech synthesizer, thereby generating a 

45 speech output, comprising the steps of: 

prestoring an average power of each vowel; and 

normalizing the VCV segment so that powers at both ends of each VCV segment coincide with the average 
power of each vowel. 

6. A method according to claim 5, wherein the step of normalizing the power of the VCV segment 
50 comprises performing normalization of the VCV segment as a whole. 

7. A method according to claim 5, wherein the step of normalizing the power of tiie VCV segment 
comprises performing normalization of only a vowel of the VCV segment. 

8. A method according to claim 5, wherein tiie step of normalizing the power of the VCV segment 
comprises adjusting tfie average power of each vowel in accordance with a power characteristic of a word 

66 or sentence of speech to be synthesized, and normalizing the power of the VCV segment. 

9. An apparatus for reading out a feature parameter and a driver sound source registered in a VCV {vowel- 
consonant-vowel) speech segment file in accordance vynth a phoneme series of speech to be synthesized, 
sequentially connecting the readout parameter and the readout sound source information in accordance with 
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a predetermined rule, and supplying the connected data to a speech synthesizer, thereby generating a 
speech output, comprising: 

means for setting a vowel constant period in accordance with an utterance speed of synthesized speech; 
and 

connecting means for expanding/reducing and connecting speech parameters in accordance witii ttie period 
set by said setting means. 

10- A speech rule synthesis method of expanding/reducing a feature parameter and a driver sound source 
registered in a speech segment file by using a phoneme series of ^Deech to be synthesized, in accordance 
with an utterance speed of synthesized speech, sequentially connecting the expanded/reduced parameter 
and the expanded/reduced driver sound source, and supplying the connected data to a speech synthesizer, 
thereby outputting synthesized speech, comprising the steps of: 

setting a vowel constant period in units of vowefs in accordance with an utterance speed of the synthesized 
speech; and 

expanding/redudng and connecting the speech parameters in acoonjance with the set period. 

11. A speech synthesis apparatus comprising: memory means for storing speech parameters by using a 
VCV segment as a basic unit; 

first setting means for setting a syllable beat point pitch in accordance witii an utterance speed of 
synthesized speech; 

second setting means for setting an expansion/reduction rate of the speech parameters In accordance with 
types of the VCV segments; 

connecting means for expanding/reducing and connecting the speech parameters in accordance with the 
expansion/reduction rate set by said second setting means; and 

synthesizing means for performing speech synthesis by using a normalizatiOTi orthogonal filter and an 
e)q)onentiat approximation filter. 

12. An apparatus according to claim 11, wherein the syllable beat point defines an utterance timing and is 
set so that a pitch between a given syllable beat point and the next syllable beat point has a predetermined 
value corresponding Id the utterance speed. 

13. An apparatus acconjing to claim 11, wherein Qie expansion/reduction rate of the speech parameters is 
determined in accordance with whether a VCV segment represented by the speech parameter tends to be 
changed in accordance with a change in utterance speed. 

14. An apparatus according to claim 11, wherein said nomnalization orthogonal filter and said exponential 
approximation filter used in said synthesizing means have characteristics in which a volume of information 
is increased in a low-frequency spectral range. 

15. A method of performing speech syntheas, comprising the steps of: 
storing speech parameters by using a VCV segment as a basic unit; 

setting an expansion/reduction rate of the speech parameter in accordance with a syllable beat point pitch 
con-esponding to an utterance speed of syntheazed speech and types of VCV segments; and 
synthesizing the speech parameter at the set expansion/redudjon rate by using a normalization orthogonal 
function and an exponential approximation function. 

16. A method according to claim 15, wh^-ein the syllable beat point defines an utterance timing and is set 
so that a pitch t>etween a given sylled^le beat point and the next syllable beat point has a predetenmined 
value corre^onding to the utterance speed. 

17. A method according to claim 15, wherein the expansion/reduction rate of the speech parameters is 
determined in accordance with whether a VCV segment represented by the speedi parameter tends to be 
changed in accordance with a change in utterance speed. 

18. A method according to claim 15. wherein the normalization orthogonal function and the exponential 
approximation function used in the syntiiesizing step have characteristics in which a volume of information 

is increased in a low-frequency spectral range. 

19. A speech synthesis apparatus comprising: 

memory means for storing a pair of a speech parameter using a VCV segment as a basic unit and 
expansion/reduction information of the speech parameter and 

connecting means for expanding/reducing the speech parameter in accordance with the expansion/reduction 
information. 

20. A method of performing speech synthesis, comprising the steps of: 

prestoring a pair of a speech parameter using a VCV segment as a basic unit and expansion/reduction 
information of the speech parameter; and 

expanding/reducing and connecting the speech parameters in accordance with the expansion/reduction 
information paired witii the speech parameter when the speech parameters are to be synthesized. 
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21 . A speech synthesis apparatus for controlling a driver sound source by speech parameters to perform 
speech synthesis, comprising synthesizing means for obtaining a voiceless vowel by using a parameter of a 
consonant immediately preceding the voiceless vowel as a parameter of the voiceless vowe) and a 
consonant sound source as said drive sound source. 
5 22. An apparatus according to claim 21 , further comprising means for determining whether a vowel to be 
synthesized is to be a voiceless vowel on the basis of prestored voiceless vowel conditions. 
23. An apparatus according to claim 21, wherein said synthesizing means obtains the parameter of the 
voiceless vowel by expanding the parameter of the consonant immediately preceding the voiceless vowel to 
a vowel period. 

10 24. An apparatus according to claim 21 , wherein said synthesizing means obtains the parameter of the 
voiceless vowel by copying the parameter of the consonant immediately preceding the voiceless vowel to a 
vowel period. 

25. A method of controlling a driver sound source by speech parameters to perform speech synthesis, 
comprising the step of synthesizing a voiceless vowel by using a parameter of a consonant immediately 

15 preceding the voiceless vowel as a parameter of the voiceless vowel and a consonant sound source as said 
driver sound source. 

26. A method according to claim 25, wherein the voiceless vowel is detemiined on the basis of 
predetermined voiceless vowel conditions. 

27. A method according to claim 25, wherein the parameter of the voiceless vowel Is obtained by expanding 
ao the parameter of the immediately preceding consonant to a vowel period. 

28. A method according to claim 25, wherein the parameter of the voiceless vowel is obtained by copying 
the parameter of the immediately preceding consonant to a vowel period. 

29. A speech synthesis apparatus comprising: sound source generating means for generating a sound 
source corresponding to an input character train; 

25 means for inverting a speech segment parameter series corresponding to the input character train along a 
time axis; 

connecting means for obtaining a speech parameter series by connecting the speech segments; 

means for synthesizing the speech parameter series and an output from said sound source generating 

means; and 

30 output means for outputting a speech signal. 

30. An apparatus according to claim 29, further comprising means for detenmining whether the speech 
segment parameter series is to be inverted along the time axis. 

31. An apparatus according to claim 29, further comprising means for converting the time axis of the 
parameter series linearly or nonlinearly when the speech segment parameter series is inverted along the 

35 time axis. 

32. A speech synthesis method comprising the steps of: 
generating a sound source corresponding to an input character train; 

inverting a paranneter series of speech segments corresponding to the input character train along a time 
axis; 

40 connecting the inverted parameter series to obtain a speech parameter series: 

synthesizing the speech parameter series with an output from sound source generating means: and 
outputting speech. 

33. A method according to claim 32, further comprising the step of determining whether the parameter 
series of the speech segments is to be inverted along the time axis. 

45 34. A method according to claim 32, further comprising the step of converting the time axis of the 
parameter series linearly or nonlinearly when the speech segment parameter series is inverted along the 
time axis. 

35. An apparatus for controlling a driver sound source by speech parameters to perform speech synthesis, 
comprising: 

50 means for inputting a text for synthesized speech and control information for controllingi an accent and an 
utterance speed of the synthesized speech; and 

synthesizing means for synthesizing speech represented by said text in accordance with the control 
information input from said input means. 

36. A method of controlling a driver sound source by speech parameters to perfomi speech synthesis, 
55 comprising the step of performing the speech synthesis in accordance with control information for 

controlling an accent and an utterar :e speed for the synthesized speech, the accent and the utterance 
speed being input together with a text for the synthesized speech. 

37. A text to speech synthesizer comprising means for varying the amplitude and means for varying tfie 
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pitch of synthesized speech, characterised in that the means for varying the anpiltude and the means for 
varying the pitch are controlied to vary the two together. 

38. A speech synthesizer capable of articulating speech at different rates in dependerice upon a rate 
parameter, characterised in that it is arranged to vary the lengths of vowel parts of speech and consonant 

5 parts of speech differently. 

39. A synthesizer according to claim 38, in which, for a given change of rate, the variation of the consonant 
length, if any, is in the opposite direction to the variation of vowel length, if any. 

40. A method of setting the rate of speech synthesis by varying the length of synthesized vowels by an 
amount greater than the variation of synthesized consonants. 

70 41 . A speech synthesizer having alternative excitation sources for voiced and unvoiced speech, comprising 
means for selecting one of said excitation sources, characterised in that, if a vowel is to be synthesized 
employing the excitation source, an immediately preceding period of the preceding consonant is syn- 
thesized at a relatively tow amplitude employing said unvoiced source. 

42. A speech synthesizer including conversion means for generating, for a corresponding predetermined 
75 pari of speech, a sequence of synthesizer parameters, said conversion means including a store holding a 
sequence of parameters corresponding to each said part of speech, characterised in that the conversion 
means includes time conversion means, and is arranged to generate parameters for given parts of speech 
by time converting the parameters stored for other parts of speech which can be mapped thereto along the 
time axis. 

20 43. A synthesizer according to claim 42 in which the time conversion means is arranged to read out 
parameter data bacltwards from the store. 
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^ (g) Disclosed is a method and app^atus for reading 
^ out a feature parameter and a driver sound source 
stored in a VCV (vowel-consonant-vowel) speech 
^ segment file, sequentially connecting the readout 
^ parameter and the readtxit sound source information 
in accordance witti a predetermined rule, and sup- 
plying connected data to a speech synthesizer, 

a. 



thereby generating a speech output including a 
memory for storing an average power of each vowel, 
and a power controller for controlling to normalize 
the VCV segment so that powers at both ends of 
each VCV segment coincide with the average power 
of each vowel. 
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